1 Executive Summary

This report analyzes 46 IPV (Intimate Partner Violence) detection experiments using Large Language Models, examining 14,790 narratives. The analysis reveals strong overall accuracy (93.1%) but identifies critical opportunities for improvement in recall (64.2%).

Key Performance Indicators:

  • Total Experiments: 46 (38 completed, 82.6% success rate)
  • Narratives Analyzed: 14,790
  • Average Accuracy: 93.1% ✅
  • Average Precision: 77.9%
  • Average Recall: 64.2% ⚠️ Needs Improvement
  • Average F1 Score: 0.681

🎯 Main Finding: The model exhibits high false negative rates, meaning it misses actual IPV cases. This is the primary area requiring attention through prompt engineering and threshold adjustment.


2 Experiment Overview

2.1 Experiment Status

2.2 Model Distribution

The primary model mlx-community/gpt-oss-120b was used in 36 experiments (94.7% of completed runs).


3 Performance Analysis

3.1 Overall Metrics

⚠️ Performance Gap Identified:

The recall score of 64.2% is significantly lower than accuracy (93.1%), indicating the model misses approximately 35.8% of actual IPV cases. This is a critical issue for a detection system where false negatives have serious consequences.

3.2 Confusion Matrix Analysis

Confusion Matrix Breakdown:

  • True Positives: 1,129 (7.6%) - Correctly identified IPV
  • True Negatives: 12,612 (85.3%) - Correctly identified non-IPV
  • False Positives: 420 (2.8%) - Incorrectly flagged as IPV
  • False Negatives: 629 (4.3%) - Missed IPV cases ⚠️

Error Analysis:

  • False Positive Rate: 27.1% of positive predictions are incorrect
  • False Negative Rate: 35.8% of actual IPV cases are missed
  • Total Error Rate: 7.1%

4 Prompt Engineering Analysis

4.1 Prompt Version Performance

The best performing prompt version is v0.3.2_indicators with an F1 score of 0.784.

4.2 Prompt Characteristics

Average prompt length: 2,636 characters


5 Temperature & Configuration

5.1 Temperature Impact on Performance

📊 Optimal Configuration: Temperature range 0.1-0.25 achieves the best F1 score of 0.687.


6 Efficiency Analysis

6.1 Runtime Performance

Runtime Statistics:

  • Total Compute Time: 21.7 hours
  • Average Experiment Duration: 34.2 minutes
  • Average Time per Narrative: 5.24 seconds
  • Range: 1.87 - 7.68 seconds


7 Top Performers & Problem Cases

7.1 Best Experiments

7.2 Problem Experiments


8 Recommendations

8.1 🎯 Optimal Configuration

Best Performing Setup:

  • Model: mlx-community/gpt-oss-120b
  • Temperature: 0.2
  • Prompt Version: v0.3.2_indicators
  • F1 Score: 0.808
  • Accuracy: 95.0%
  • Recall: 87.5%

This configuration should serve as the baseline for future experiments.

8.2 Priority Actions

8.2.1 1. Address Low Recall (Critical)

The 64.2% recall rate means 35.8% of actual IPV cases are missed.

Immediate Actions: - Expand IPV indicator examples in prompts (especially subtle cases) - Add more diverse relationship violence scenarios - Consider lowering the confidence threshold for positive detection - Review false negative cases for pattern identification

8.2.2 2. Prompt Engineering Improvements

  • Use successful prompt features: • Add examples
  • Optimal temperature: Use 0.1-0.25 range (F1: 0.687)
  • Test variations with different indicator emphasis
  • A/B test instruction formats (imperative vs. question-based)

8.2.3 3. Future Experiments

Priority testing queue:

  1. Temperature exploration: Currently tested \[0, 0.1, 0.2\]
    • Try: 0.3, 0.7 if not yet tested
  2. Prompt variations:
    • With/without examples comparison
    • Different indicator ordering
    • Varied context windows
  3. Model comparison:
    • Test other available models
  4. Validation:
    • Run best config 3-5 times with different seeds
    • Cross-validate with held-out data

8.2.4 4. Monitoring & Quality

  • Set up automated performance tracking
  • Define acceptable F1 threshold (recommend ≥ 0.75)
  • Monitor drift in false negative rate
  • Regular prompt review cycles

9 Data Quality Assessment

Data Completeness:

  • Experiments with Metrics: 38/46 (82.6%)
  • Experiments with Confusion Matrix: 38/46 (82.6%)
  • Experiments with Runtime Data: 38/46 (82.6%)

9.1 Issues & Warnings

⚠️ 2 Failed Experiments - Review error logs for root cause

⚠️ 3 Stalled Experiments - May need manual intervention

ℹ️ 3 Running Experiments - Wait for completion before final analysis

⚠️ 3 Accuracy Outliers Detected (>2σ from mean)

  • Test qwen/qwen3-next-80b new prompt 2025-10-03: 78.0%
  • Test qwen/qwen3-next-80b with modified prompt 2025-10-03: 84.7%
  • Test GPT-OSS-120B Baseline: 100.0%

10 Conclusions

10.1 Key Findings

10.1.1 Performance Assessment

With an average F1 score of 0.681, the system demonstrates moderate performance. However, the 64.2% recall rate indicates significant room for improvement in detecting actual IPV cases. :::

10.1.2 Consistency

Consistency: MODERATE (F1 σ = 0.086)

The moderate variation across experiments suggests investigating sources of variation.

10.1.3 Error Pattern

Error Profile: BALANCED

False positives and negatives are balanced.

10.2 Next Steps

  1. Immediate: Implement recall improvement strategies from Section 7.1
  2. Short-term: Run validation experiments with best configuration
  3. Medium-term: Expand prompt testing with identified optimal features
  4. Long-term: Establish continuous monitoring and improvement pipeline
  5. Documentation: Document best practices based on findings

Expected Impact of Recommendations:

If recall can be improved to 80% while maintaining current precision: - New F1 Score: ~0.78 (current: 0.681) - Reduction in missed cases: ~40% - Overall system reliability: Substantially improved


11 Appendix

Database Connection: - Host: memini.lan - Port: 5433 - Database: postgres - Report Generated: 2025-10-05 11:55:13 EDT

Analysis Parameters: - Total Experiments Analyzed: 46 - Completed Experiments: 38 - Total Narratives: 14,790 - Date Range: 2025-10-03 to 2025-10-04


This report was automatically generated from PostgreSQL experimental data.
For questions or issues, contact the research team or review source tables: experiments and narrative_results.